Goto

Collaborating Authors

 critical point


Flat Channels to Infinity in Neural Loss Landscapes

Neural Information Processing Systems

The loss landscapes of neural networks contain minima and saddle points that may be connected in flat regions or appear in isolation. We identify and characterize a special structure in the loss landscape: channels along which the loss decreases extremely slowly, while the output weights of at least two neurons, ai and aj, diverge to infinity, and their input weight vectors, wi and wj, become equal to each other. At convergence, the two neurons implement a gated linear unit: aiฯƒ(wi x) + ajฯƒ(wj x) cฯƒ(w x) + (v x)ฯƒ (w x). Geometrically, these channels to infinity are asymptotically parallel to symmetry-induced lines of critical points. Gradient flow solvers, and related optimization methods like SGD or ADAM, reach the channels with high probability in diverse regression settings, but without careful inspection they look like flat local minima with finite parameter values. Our characterization provides a comprehensive picture of these quasi-flat regions in terms of gradient dynamics, geometry, and functional interpretation. The emergence of gated linear units at the end of the channels highlights a surprising aspect of the computational capabilities of fully connected layers.


ATale of Two Symmetries: Exploring the Loss Landscape of Equivariant Models

Neural Information Processing Systems

Equivariant neural networks have proven to be effective for tasks with known underlying symmetries. However, optimizing equivariant networks can be tricky and best training practices are less established than for standard networks. In particular, recent works have found small training benefits from relaxing equivariance constraints. This raises the question: do equivariance constraints introduce fundamental obstacles to optimization? Or do they simply require different hyperparameter tuning?


ACloser Look at NTKAlignment: Linking Phase Transitions in Deep Image Regression

Neural Information Processing Systems

Deep neural networks trained with gradient descent exhibit varying rates of learning for different patterns. However, the complexity of fitting models to data makes direct elucidation of the dynamics of learned patterns challenging. To circumvent this, many works have opted to characterize phases of learning through summary statistics known as order parameters. In this work, we propose a unifying framework for constructing order parameters based on the Neural Tangent Kernel (NTK), in which the relationship with the data set is more transparent. In particular, we derive a local approximation of the NTK for a class of deep regression models (SIRENs) trained to reconstruct natural images. In so doing, we analytically connect three seemingly distinct phase transitions: the emergence of wave patterns in residuals (a novel observation), loss rate collapse, and NTK alignment. Our results provide a dynamical perspective on the observed biases of SIRENs, and deep image regression models more generally.


The Persistence of Neural Collapse Despite Low-Rank Bias

Neural Information Processing Systems

Neural collapse (NC) and its multi-layer variant, deep neural collapse (DNC), describe a structured geometry that occurs in the features and weights of trained deep networks. Recent theoretical work by Sukenik et al. using a deep unconstrained feature model (UFM) suggests that DNC is suboptimal under mean squared error (MSE) loss. They heuristically argue that this is due to low-rank bias induced by L2 regularization. In this work, we extend this result to deep UFMs trained with cross-entropy loss, showing that high-rank structures--including DNC--are not generally optimal. We characterize the associated low-rank bias, proving a fixed bound on the number of non-negligible singular values at global minima as network depth increases. We further analyze the loss surface, demonstrating that DNC is more prevalent in the landscape than other critical configurations, which we argue explains its frequent empirical appearance. Our results are validated through experiments in deep UFMs and deep neural networks.


From Saddle Points Toward Global Minima: A Newton-Type Method on Wasserstein Space

arXiv.org Machine Learning

We study the minimization of non-convex functionals over the Wasserstein space. While recent work has showed that perturbed Wasserstein gradient methods can avoid saddle points for benign landscapes, existing approaches remain essentially first-order and do not provide fast local convergence once the iterates enter a neighborhood of a global minimizer. We propose Wasserstein Saddle-Free Newton (WSFN), a second-order method that preconditions the Wasserstein gradient by a regularized square root of the squared Wasserstein Hessian. This construction preserves attraction toward directions of positive curvature while inducing repulsion along directions of negative curvature, thereby overcoming the tendency of standard Wasserstein Newton dynamics to be attracted to saddles. We also establish second-order sufficient optimality conditions on Wasserstein space for strict local minimality. Under regularity and benign landscape assumptions, we prove that WSFN escapes saddle regions and reaches an $ฮฑ$-neighborhood of a global minimizer in polynomial time, with improved dependence on saddle parameters compared with prior perturbed first-order methods. Once inside this neighborhood, we show that WSFN converges linearly in $L^2$-Wasserstein distance to a non-degenerate global minimizer. Finally, we present a particle-based implementation of the method.


Attention-based PCA

arXiv.org Machine Learning

We study attention mechanisms through the lens of a canonical unsupervised problem: principal component analysis (PCA). We show that, when trained on Gaussian data, both softmax and linear attention layers learn parameters that align with the principal eigenvectors of the covariance matrix, thereby establishing a direct and explicit connection with PCA. Our analysis covers both finite and infinite prompt regimes. In the infinite-prompt limit, we prove convergence to globally optimal solutions aligned with the leading spectral direction, while in the finiteprompt setting we show that the same behavior emerges up to sampling effects. We further extend the analysis to an in-context setting with spiked Wishart covariances, where attention successfully recovers the underlying signal direction. These results demonstrate that attention inherently performs PCA-like computations under unsupervised objectives, providing a theoretical foundation for its representation-learning capabilities.


f5ccb3ab757131a93586ef61ec701533-Supplemental-Conference.pdf

Neural Information Processing Systems

In this section, we compare the symmetric solutions found in erf [2] and ReLU networks [5] to our one-neuron solution (n =1). The main difference is that both earlier studies constrain the search space to the symmetric subspace whereas we first prove that the non-trivial critical points are contained in this subspace in Theorem 5.1 for a broad class of activation functions, including erf and ReLU. Solving the low-dimensional loss, we recover the same solution for ReLU and erf as in [2, 5] for unit-orthonormal teachers.


Should Under-parameterized Student Networks Copy or Average Teacher Weights?

Neural Information Processing Systems

Any continuous function f can be approximated arbitrarily well by a neural network with sufficiently many neurons k. We consider the case when f itself is a neural network with one hidden layer and k neurons. Approximating f with a neural network with n < k neurons can thus be seen as fitting an under-parameterized "student" network with nneurons to a "teacher" network with k neurons. As the student has fewer neurons than the teacher, it is unclear, whether each of the n student neurons should copy one of the teacher neurons or rather average a group of teacher neurons. For shallow neural networks with erf activation function and for the standard Gaussian input distribution, we prove that "copy-average" configurations are critical points if the teacher's incoming vectors are orthonormal and its outgoing weights are unitary. Moreover, the optimum among such configurations is reached when n 1student neurons each copy one teacher neuron and the n-th student neuron averages the remaining k n+1 teacher neurons. For the student network with n = 1 neuron, we provide additionally a closed-form solution of the non-trivial critical point(s) for commonly used activation functions through solving an equivalent constrained optimization problem. Empirically, we find for the erf activation function that gradient flow converges either to the optimal copy-average critical point or to another point where each student neuron approximately copies a different teacher neuron. Finally, we find similar results for the ReLU activation function, suggesting that the optimal solution of underparameterized networks has a universal structure.